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ABSTRACT 

Motivated by the problems of computing sample covariance 
matrices, and of transforming a collection of vectors to a 
basis where they are sparse, we present a simple algorithm 
that computes an approximation of the product of two n- 
by-n real matrices A and B. Let ||AB||f denote the Frobe- 
nius norm of AB, and 6 be a parameter determining the 
time/accuracy trade-off. Given 2-wise independent hash 
functions hi,h,2 : [n] — > [b], and si,S2 : [n] — > { — 1, +1} the 
algorithm works by first "compressing" the matrix product 
into the polynomial 

n / n \ / n \ 

K*) = E(E A W^ Mi) J (J2 B *Mj)x h2U) ) ■ 

Using FFT for polynomial multiplication, we can compute 
Co, ... , c b _i such that £\ dx' = (p(x) mod x )+(p(x) div x b ) 
in time <D(n 2 + nb). An unbiased estimator of (AB)ij with 
variance at most ||A_B||f?/6 can then be computed as: 

C i3 = Sl(l) S 2 (j) C(h 1 (i) + h 2 (j)) mod 6 ■ 

Our approach also leads to an algorithm for computing AB 
exactly, whp., in time 0{N + nb) in the case where A and 
B have at most TV nonzero entries, and AB has at most 6 
nonzero entries. Also, we use error-correcting codes in a 
novel way to recover significant entries of ^473 in near-linear 
time. 

Categories and Subject Descriptors 

F.2 [ANALYSIS OF ALGORITHMS AND PROB- 
LEM COMPLEXITY]: Numerical Algorithms and Prob- 
lems; G.3 [PROBABILITY AND STATISTICS]; G.4 
[MATHEMATICAL SOFTWARE]; E.2 [DATA STOR- 
AGE REPRESENTATIONS]: Hash-table representations 
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1. INTRODUCTION 

Several computational problems can be phrased in terms 
of matrix products where the normalized result is expected 
to consist mostly of numerically small entries: 

• Given m samples of a multivariate random variable 
(X±, . . . , X„), compute the sample covariance matrix 
which is used in statistical analyses. If most pairs of 
random variables are independent, the corresponding 
entries of the sample covariance matrix will be concen- 
trated around zero. 

• Linearly transform all column vectors in a matrix B to 
an orthogonal basis A T in which the columns of B are 
approximately sparse. Such batch transformations are 
common in lossy data compression algorithms such as 
JPEG, using properties of specific orthogonal bases to 
facilitate fast computation. 

In both cases, an approximation of the matrix product may 
be as good as an exact computation, since the main issue is 
to identify large entries in the result matrix. 

In this paper we consider n-by-n matrices with real values, 
and devise a combinatorial algorithm for the special case of 
computing a matrix product AB that is "compressible". For 
example, if AB is sparse it falls into our class of compressible 
products, and we are able to give an efficient algorithm. 
More generally, if the Frobenius norm of AB is dominated 
by a sparse subset of the entries, we are able to quickly 
compute a good approximation of the product AB. Our 
method can be seen as a compressed sensing method for the 
matrix product, with the nonstandard idea that the sketch 
of AB is computed without explicitly constructing AB. The 
main technical idea is to use FFT JTTJ to efficiently compute 
a linear sketch of an outer product of two vectors. We also 
make use of error-correcting codes in a novel way to achieve 
recovery of the entries of AB having highest magnitude in 
near-linear time. 

Our main conceptual messages are: 

• It is possible to derive a fast and simple, "combinato- 
rial" algorithm for matrix multiplication in two cases: 
When the output matrix is sparse, and when an addi- 
tive error on each output entry is acceptable. 



• It is interesting to consider the use of compressed sens- 
ing techniques for computational problems where re- 
sults (or important intermediate results) can be rep- 
resented using sparse vectors. We outline some such 
targets in the conclusion. 



• The interface between theoretical computer science and 
statistics is an area where there is high potential for 
cross-fertilization (see also [2l] for arguments in this 
direction). 

1.1 Related work 

Matrix multiplication with sparse output. 

Lingas [20] considered the problem of computing a ma- 
trix product AB with at most b entries that are not trivially 
zero. A matrix entry is said to be trivially zero if every term 
in the corresponding dot product is zero. In general b can 
be much larger than the number b of nonzeros because zeros 
in the matrix product may be due to cancellations. Lingas 
showed, by a reduction to fast rectangular matrix multiplica- 
tion, that this is possible in time 0(n 2 6 188 ). Observe that 
for b = n 2 this becomes identical to the 0(n 2,376 ) bound by 
Coppersmith and Winograd [12] . 

Yuster and Zwick [25] devised asymptotically fast algo- 
rithms for the case of sparse input matrices, using a matrix 
partitioning idea. Amossen and Pagh 4' extended this re- 
sult to be more efficient in the case where also the output 
matrix is sparse. In the dense input setting of Lingas, this 
leads to an improved time complexity of O(n 1 724 b 40S ) for 
n<b<n 125 . 

Iwen and Spencer [l9] showed how to use compressed sens- 
ing to compute a matrix product AB in time 0(n 2+e ), for 
any given constant e > 0, in the special case where each 
column of AB contains at most n 29462 nonzero values. (Of 
course, by symmetry the same result holds when there is 
sparseness in each row.) 

All the results described above work by reduction to fast 
rectangular matrix multiplication, so the algorithms are not 
"combinatorial." However, Lingas [20] observed that a time 
complexity of 0(n 2 + bn) is achieved by the column-row 
method, a simple combinatorial algorithm. Also, replac- 
ing the fast rectangular matrix multiplication in the result 
of Iwen and Spencer 



19 



by a naive matrix multiplication 
algorithm, and making use of randomized sparse recovery 
methods (see [15]), leads to a combinatorial algorithm run- 
ning in time 0(n +nb) when each column oi AB has 0(b/n) 
nonzero values. 

Approximate matrix multiplication. 

The result of [19] is not restricted to sparse matrix prod- 
ucts: Their algorithm is shown to compute an approximate 
matrix product in time C(n 2+e ) assuming that the result 
can be approximated well by a matrix with sparse column 
vectors. The approximation produced is one with at most 
n 029462 nonzero values in each column, and is almost as 
good as the best approximation of this form. However, if 
some column of AB is dense, the approximation may differ 
significantly from AB. 

Historically, Cohen and Lewis [To] were the first to con- 
sider randomized algorithms for approximate matrix mul- 
tiplication, with theoretical results restricted to the case 
where input matrices do not have negative entries. Sup- 
pose A has column vectors a\, . . . , a n and B has row vectors 
bi ,...,&„ . The product of A and B can be written as a sum 
of n outer products: 



The method of Cohen and Lewis can be understood as sam- 
pling each outer product according to the weight of its en- 
tries, and combining these samples to produce a matrix C 
where each entry is an unbiased estimator for (AB)ij. If 
n 2 c samples are taken, for a parameter c > 1, the difference 
between C and AB can be bounded, whp., in terms of the 
Frobenius norrrQof AB, namely 

\\AB-C\\ F = 0(\\AB\\ F /Vc) . 



(This is not shown in 10 , but follows from the fact that 



each estimator has a scaled binomial distribution 
Drineas, Kannan, and Mahoney 
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showed how a sim- 
pler sampling strategy can lead to a good approximation of 
the form CR, where matrices C and R consist of c columns 
and c rows of A and B, respectively. Their main error 
bound is in terms of the Frobenius norm of the difference: 
\\AB - CR\\ F = C(||.4||f||£||f/v / c)- The time to compute 
CR using the classical algorithm is 0(n 2 c) — asymptoti- 
cally faster results are possible by fast rectangular matrix 
multiplication. Drineas et al. also give bounds on the ele- 
mentwise differences \ (AB — CR)ij\, but the best such bound 
obtained is of size Q,{M 2 n/y/c), where M is the magnitude 
of the largest entry in A and B. This is a rather weak bound 
in general, since the largest possible magnitude of an entry 
in AB is M 2 n. 

Sarlos [23] showed how to achieve the same Frobenius 
norm error guarantee using c AMS sketches [2] on rows of A 
and columns of B. Again, if the classical matrix multiplica- 
tion algorithm is used to combine the sketches, the time com- 
plexity is 0(n 2 c). This method gives a stronger error bound 
for each individual entry of the approximation matrix. If we 
write an entry of AB as a dot product, (AB)ij = hi ■ bj, the 
magnitude of the additive error is C? ( 1 1 5,^ 1 1 2 1 1 1 1 2 / with 
high probability (see [23| [l]). In contrast to the previous re- 
sults, this approximation can be computed in a single pass 
over the input matrices. Clarkson and Woodruff [8] further 
refine the results of Sarlos, and show that the space usage is 
nearly optimal in a streaming setting. 

1.2 New results 

In this paper we improve existing results in cases where 
the matrix product is "compressible" — in fact, we produce 
a compressed representation of AB. Let N < 2n 2 denote 
the number of nonzero entries in A and B. We obtain an 
approximation C by a combinatorial algorithm running in 
time 0(N + nb), making a single pass over the input while 
using space 0(b\gn), such that: 

• If AB has at most b nonzero entries, C = AB whp. 

• If AB has Frobenius norm q when removing its b largest 
entries, the error of each entry is bounded, whp., by 



\Ci: 



(AB)ij\ < q/Vb 



Compared to Cohen and Lewis [To| we avoid the restric- 
tion that input matrices cannot contain negative entries. 
Also, their method will produce only an approximate result 



1 The Frobenius norm of a matrix A is defined as 



\A\ 



AB : 



(1) 



even when AB is sparse. Finally, their method inherently 
uses space 6>(n 2 ), and hence is not able to exploit compress- 
ibility to achieve smaller space usage. 

Our algorithm is faster than exi stin g matrix multiplication 
algorithms for sparse outputs [4] [20] whenever b < n 6 ^ 5 , as 
well as in situations where a large number of cancellations 
mean that b <C b. As a more conceptual contribution it 
is to our knowledge the only "simple" algorithm to signif- 
icantly improve on Strassen's algorithm for sparse outputs 
with many entries that are not trivially zero. 

The simple, combinatorial algorithms derived from Drineas 
et al. 
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and Sarlos 23 yield error guarantees that are gen- 
erally incomparable with those achieved here, when allow- 
ing same time bound, i.e., c = G(6/n). The Frobenius error 
bound we achieve is: 

\\AB- C\\ F < \\AB\\ F ^/nfc . 

Our result bears some similarity to the result of Iwen and 
Spencer 19 , since both results be seen as compressed sens- 



ing of the product AB. One basic difference is that Iwen and 
Spencer perform compressed sensing on each column of AB 
(n sparse signals), while we treat the whole matrix AB as a 
single sparse signal. This means that we are robust towards 
skewed distribution of large values among the columns. 

2. ALGORITHM AND ANALYSIS 

We will view the matrix AB as the set of pairs 
where the weight of item (i, j) is (AB)ij. Our approach is 
to compute a linear sketch p ak b k for each outer product of 
(fll), and then compute ^2 k Pa k b k to obtain a sketch for AB. 
For exposition, we first describe how to compute an AMS 
sketch of this weighted set in time 0(n), and then extend 
this to the more accurate Count Sketch. In the following 
we use [n] to denote the set {1, . . . , n}. 

2.1 AMS sketches 

Alon, Matias, and Szegedy [2] described and analyzed the 
following approach to sketching a data stream Z\,Za,..., 
where item z% £ [n] has weight wf. Take a 2- wise inde- 
pendent function s : [n] — > {— 1, +1} (which we will call 
the sign function), and compute the sum X = £\ s(zi)uii 
(which we refer to as the AMS sketch). We will use a sign 
function on pairs (i,j) that is a product of sign functions 
011 the coordinates: s(i,j) = si(i)s->(j). Indyk and McGre- 
gor [l8], and Braverman et al. [H] have previously analyzed 
moments of AMS sketches with hash functions of this form. 
However, for our purposes it suffices to observe that s(i,j) 
is 2- wise independent if si and S2 are 2- wise independent. 
The AMS sketch of an outer product uv, where by definition 
{uv)ij = UiVj, is: 



(i,j)S[n]x[n] 



s{i,j){uv)ij= ^si(i)ui [J2s 2 (j)v 



j = i 



That is, the sketch for the outer product is simply the prod- 
uct of the sketches of the two vectors (using different hash 
functions). A single AMS sketch has variance roughly ||uv||f., 
which is too large for our purposes. Taking the average of 



2 We will use "h is a a fc-wise independent hash function" as 
a shorthand for "h is chosen uniformly at random from a 
fc-wise independent family of hash functions, independently 
of all other random choices made by the algorithm". 



b such sketches to reduce the variance by a factor ■s/b would 
increase the time to retrieve an estimator for an entry in the 
matrix product by a factor of b. By using a different sketch 
we can avoid this problem, and additionally get better esti- 
mates for compressible matrices. 

2.2 Count sketches 

Our algorithm will use the Count Sketch of Charikar, 
Chen and Farach-Colton m, which has precision at least as 
good as the estimator obtained by taking the average of b 
AMS sketches, but is much better for skewed distributions. 
The method maintains a sketch of any desired size b, using a 
2- wise independent splitting function h : [n] — > {0, . . . , 6— 1} 
to divide the items into b groups. For each group, an AMS 
sketch is maintained using a 2-wise independent hash func- 
tion s : [n] — > {— 1,+1}. That is, the sketch is the vector 
(c , . . . , Cb-i) where c k = £\ , l(Zj)=fc s(zi) Wi. An unbiased 
estimator for the total weight of an item z is Ch( z )s(z). To 
obtain stronger guarantees, one can take the median of sev- 
eral estimators constructed as above (with different sets of 
hash functions) — we return to this in section 3.3 



Sketching a matrix product naively. 

Since Count Sketch is linear we can compute the sketch 
for AB by sketching each of the outer products in |T]), and 
adding the sketch vectors. Each outer product has 0(n 2 ) 
terms, which means that a direct approach to computing its 
sketch has time complexity 0(n 2 ), resulting in a total time 
complexity of 0(n 3 ). 

Improving the complexity 

We now show how to improve the complexity of the outer 
product sketching from 0(n 2 ) to 0(n + blgb) by choosing 
the hash functions used by CountSketch in a "decompos- 
able" way, and applying FFT. We use the sign function s(i, j) 
defined in section [2~T| and similarly decompose the function 
h as follows: h(i,j) = h\(i) + /12O) mod b, where hi and h,2 
are chosen independently at random from a 3-wise indepen- 
dent family. It is well-known that this also makes h 3-wise 
independent [6j[22]. Given a vector u £ R" and functions 
h t : [n] — > {0, 1}, s t : [n] — > { — 1, 4-1} we define the 

following polynomial: 



ht(i) 



*(x) = ^ S t (i) Uj X 



The polynomial can be represented either in the standard 
basis as the vector of coefficients of monomials x° , . . . ,x b ~ 1 , 
or in the discrete Fourier basis as the vector 



(p^^ ),^'^ 1 ), 



where u is a complex number such that ui =1. The ef- 
ficient transformation to the latter representation is known 
as the fast Fourier transform (FFT), and can be computed 
in time C(61og6) when b is a power of 2 11 . Taking 
componentwise products of the vectors representing Pu 1,si 
and Pv 2 - 82 in the Fourier basis we get a vector p* where 
Pt = Pu 1 '" 1 (^') Pv 2 ' S2 {^)- Now consider the following poly- 
nomial: 



Plv(x) = ^ s(i,j) (uv)ij x k 



h(i,j) = k 



Using a/ h(! ' j) = td'MO+tfcaCi) we have that 



h(i,j)t 



^2 Si (i) Ui cj" 11 W ^ s 2 (j) ^ w 



= Pt ■ 

That is, p* is the representation, in the discrete Fourier 
basis, of p*„. Now observe that the coefficients of p%, v (x) 
are the entries of a Count Sketch for the outer prod- 
uct uv using the sign function s(i,j) and splitting function 
h(i,j). Thus, applying the inverse FFT to p* we compute 
the Count Sketch of uv. 

The pseudocode of figure [T] called with parameter d = 1, 
summarizes the encoding and decoding functions discussed 
so far. For simplicity the pseudocode assumes that the hash 
functions involved are fully random. A practical implemen- 
tation of the involved hash functions is character-based tab- 
ulation [22], but for the best theoretical space bounds we 
use polynomial hash functions |13| . 

Time and space analysis. 

We analyze each iteration of the outer loop. Computing 
Puv(x) takes time 0(n + blgb), where the first term is the 
time to construct the polynomials, and the last term is the 
time to multiply the polynomials, using FFT [11 . Comput- 
ing the sketch for each outer product and summing it up 
takes time 0(n 2 + nfelgfc). Finally, in time 0(n 2 ) we can 
obtain the estimate for each entry in AB. 

The analysis can be tightened when A and B are sparse or 
rectangular, and it suffices to compute the sketch that allows 
random access to the estimate C. Suppose that A is m-by- 
n 2 , and B is ri2-by-n3, and they contain TV <C n 2 nonzero 
entries. It is straightforward to see that each iteration runs 
in time 0(N + n 2 b\gb), assuming that A and B are given 
in a form that allows the nonzero entries of a column (row) 
to be retrieved in linear time. 

The required hash functions, with constant evaluation time, 
can be stored in space 0(d) using polynomial hash func- 
tions [13]. The space required for the rest of the compu- 
tation is 0(db), since we are handling 0(d) polynomials of 
degree less than b. Further, access to the input is restricted 
to a single pass if A is stored in column-major order, and B 
is stored in row-major order. 

Lemma 1. CompressedProduct(v4, B, b, d) runs in time 
0(d(N + n 2 blgb)), and uses space for O(db) real numbers, 
in addition to the input. 

We note that the time bound may be much smaller than n 2 , 
which is the number of entries in the approximate product C. 
This is because C is not constructed explicitly. In section [4] 
we address how to efficiently extract the b largest entries 
of C. 



3. ERROR ANALYSIS 

We can obtain two kinds of guarantees on the approxi 
mation: One in terms of the Frobenius norm (section 



function CompressedProduct(^4, B, b, d) 
for t := 1 to d do 

si[t],s 2 [i] €fl Maps({l,...,n} -> {-1,+1}) 
h 1 [t],h 2 [t] £ R Maps({l,...,n} {0, . . . , b - 1}) 
p[t] := 

for k := 1 to n do 
{pa,pb) := (0,0) 
for i := 1 to n do 

pa\h x (i)\ :=pa[hi(i)] + si [t] (i) A ik 
end for 

for j := 1 to n do 

pa[h 2 (j)} := P b[h 2 (j)} + aa [*](?') B ^ 
end for 

(pa,pb) := (FFT (pa), FFT(pfe)) 
for z := 1 to b do 

p[t][z] := P [t][z}+pa[z]pb[z} 
end for 
end for 
end for 
end for 

for t := 1 to d do p[t] := FFT 1 (p[t]) end for 
return (p, si, s 2 , hi, /12) 
end 

function Decompress(i, j) 
for t := 1 to d do 

X t :=si[t]{i)s 2 \t](j) P [t][(hi(i) + h 2 (j)) mod 6] 
return Median(Xi, . . . , X d ) 
end 

Figure 1: Method for encoding an approximate 
representation of AB of size bd, and corresponding 
method for decoding its entries. Maps(D — > C) de- 
notes the set of functions from D to C, and Gj? de- 
notes independent, random choice. (Limited ran- 
domness is sufficient to obtain our guarantees, but 
this is not reflected in the pseudocode.) We use 
to denote a zero vector (or array) of suitable size 
that is able to hold complex numbers. FFT(p) de- 
notes the discrete fourier transform of vector p, and 
FFT -1 its inverse. 



considers the application of our result to covariance matrix 
estimation. 

3.1 Frobenius norm guarantee 



Theorem 2. For d = 1 and any (i*,j*), the function 
call Decompress(i*, j*) computes an unbiased estimator for 
(AB)i'j' with variance bounded by \ \ AB\ \ 2 p/b. 

Proof. For i,j € {l,...,n} let Kij be the indicator 
variable for the event h(i,j) = h(i*,j*). Since h is 3-wise 
independent these events are 2-wise independent. We can 
write X as: 



X = s(i*,f) ^2Kij a (i,j)(AB)i 



3.11 



which applies even if we use just a single set of hash func- Observe that K(i* ,j*) = 1, E[s(i*, j*)s(i, j)] — whenever 
tions (d = 1), and stronger guarantees (section 3.3 1 that (*>j) 7^ (i*>3*)> an d E[s(i*,j*) 2 ] = 1. This implies that 
require the use of d = 0(lgn) hash functions. Section [3.2| E[X] — (AB)i*j*. To bound the variance of X we rewrite 



it as: 

X=(AB) i . j .+s(i*,j*) K t:J s(i,j)(AB) tJ (2) 

Since s(i*,j*) 2 = 1 and the values Kij, i,j € {l,...,n}, 
and s(i, j), i, j G {1, . . . , n} are 2-wise independent the terms 
have covariance zero, so the variance is simply the sum 

£ Var(iC i , ja (i,j)(AB) 4J ) . 

We have that E[Kijs(i, j)(AB)ij] — 0, so the variance of 
each term is equal to its second moment: 

E[(K iljS {i,j){AB)^ 2 ] = (AB)lB[K itj ] = (AB%/b . 

Summing over all terms we get that Var(X) < 1 1 AB\ \ F /b. □ 

As a consequence of the lemma, we get from Chebychev's 
inequality that each estimate is accurate with probability 
3/4 up to an additive error of 2\\AB\\ F /Vb. For sufficiently 
large d — O(lgn) this holds for all entries with high proba- 
bility, by Chernoff bounds. 

3.2 Covariance matrix estimation 

We now consider the application of our result to covari- 
ance matrix estimation from a set of samples. The covari- 
ance matrix captures pairwise correlations among the com- 
ponents of a multivariate random variable X = {X\ , . . . , X n ) T 
It is defined as cov(X) = E[(X -E[X])(X -E[X]) T ]. We can 
arrange observations Xi , . . . , x m of X as columns in an n-by- 
m matrix A. Figure [2] illustrates how approximate matrix 
multiplication can be used to find correlations among rows 
of A (corresponding to components of X). In the following 
we present a theoretical analysis of this approach. 

The sample mean of X is x — — ^2^ =1 x i- The sample 
covariance matrix Q is an unbiased estimator of cov(X), 
given by: 

m 
8=1 

Let xl denote the n-by-m matrix that has all columns equal 
to x. Then we can write Q as a matrix product: 

Q= ^(A-xl^A-xlf . 

To simplify calculations we consider computation of Q 
which is derived from Q by setting entries on the diagonal 
to zero. Notice that a linear sketch of Q can be transformed 
easily into a linear sketch of Q, and that a sketch of Q also 
allows us to quickly approximate Q. Entry Qij, i 7^ j is a 
random variable that has expectation if Xi and Xj are 
independent. It is computed as — times the sum over m 
observations of (Xi — E[Xi])(Xj — E[ Xj\) . Assuming inde- 
pendence and using the formula from [IT], this means that 
its variance is ™ a Var(X i )Var(X i ) <^Var(Xi)Var(A», 
for m > 2. If cov(A) is a diagonal matrix (i.e., every pair of 
variables is independent), the expected squared Frobenius 



norm of Q is: 

E[||Q|||] =2^E[Q? i ] = 2J2 Var(Qy) 

i<j i<j 




In a statistical test for pairwise independence one will as- 
sume independence, and test if the sample covariance matrix 
is indeed close to diagonal. We can derive an approximation 
guarantee from Theorem [2] for the sketch of Q (and hence 
Q), assuming the hypothesis that cov(X) is diagonal. If this 
is not true, our algorithm will still be computing an unbi- 
ased estimate of cov(X), but the observed variance in each 
entry will be larger. 

Theorem 3. Consider m observations of random vari- 
ables Xi , . . . , X n that are pairwise independent. We can 
compute in time 0((n + b)m) and space 0(b) an unbiased 
approximation to the sample covariance matrix with additive 
error on each entry (whp.) GQ2™ =1 Var(Xi)/Vmb). 

No similar result follows from the method of Cohen and 
Lewis [To], which gives no theoretical guarantees when ap- 
plied to matrices with negative entries. Similarly, the al- 
gorithms of Drineas et al. [14] and Sarlos [23] do not have 
sufficiently strong guarantees on the error of single entries 
to imply Theorem [3] We note that a similar result could be 
obtained by the method of Iwen and Spencer [19] . 

The special case of indicator random variables is of par- 
ticular interest. Then Var(Xi) < n, and if m > n we 
can achieve additive error o(l) in time slightly superlinear 
in the size of the input matrix: 

Corollary 4. Consider m observations of indicator ran- 
dom variables X\, . . . , X n that are pairwise independent. Us- 
ing space b > n and time 0(mb) we can compute an approx- 
imation to the sample covariance matrix with additive error 
0(n/Vmb). 

3.3 Tail guarantees 

We now provide a stronger analysis of our matrix mul- 
tiplication algorithm, focusing on the setting in which we 
compute d = O(lgn) independent sketches as described in 
section [2~2} and the estimator returned is the median. 

3.3.1 Sparse outputs. 

We first show that sparse outputs are computed exactly 
with high probability. 

Theorem 5. Suppose AB has at most 6/8 nonzero en- 
tries, andd> 61gn. Then CompressedProduct(A, B, b, d) 
together with Decompress correctly computes AB with prob- 
ability 1 — o(l). 

Proof. Let So denote the set of coordinates of nonzero 
entries in AB. Consider again the estimator |2| for (AB)i*j* . 
We observe that X 7^ (AB)i«j-» can only happen when 
7^ f° r some 7^ with (AB)ij 7^ 0, i.e., 

€ So and h(i,j) = h(i*,j*). Since h is 2-wise inde- 
pendent with range b and |So| < b/8, this happens with 
probability at most 1/8. The expected number of sketches 



Figure 2: Finding correlations using approximate matrix multiplication. Upper left: 100-by-100 random 
matrix A where all entries are sampled from [—1; 1], independently except rows 21 and 66 which are positively 
correlated. Upper right: AA T has mostly numerically small values off diagonals, expect entries (66,21) and 
(21,66) corresponding to the correlated rows in A. Lower left: Approximation of AA T output by our algorithm 
using b = 2000. Lower right: After subtracting the contribution of diagonal elements of AA T and thresholding 
the resulting approximation, a small set of entries remains that are "candidates for having a large value", 
including (66,21). 



with X 7^ (AB)i*j* is therefore at most d/8. If less than 
d/2 of the sketches have X 7^ (AB)i*j*, the median will be 
(AB)i*j*. By Chernoff bounds, the probability that the ex- 
pected value d/8 is exceeded by a factor 4 is (e 4_1 /4 4 ) d//1 ° — 
o(2~ d ^ 3 ) . In particular, if we choose d — 6 lg n then the prob- 
ability that the output is correct in all entries is 1 — o(l). □ 

3.3.2 Skewed distributions. 

Our next goal is to obtain stronger error guarantees in the 
case where the distribution of values in AB is skewed such 
that the Frobenius norm is dominated by the 6/20 largest 
entries. 

Let Err^(M) denote the squared Frobenius norm (i.e., 
sum of entries squared) of a matrix that is identical to matrix 
M except for its fc largest entries (absolute value), where it 
is zero. 

Theorem 6. Suppose that d > 61gn. Then Decom- 
press in conjunction with CompressedProduct(A, B, b, d) 
computes a matrix C such that for each entry dj we have 

\Cij - {AB) l3 \ < 12^ Err/ 20 (AB)/b 

with probability 1 — o(n~ 2 ). 

PROOF. Consider again equation ([2| that describes a sin- 
gle estimator for (AB)i*j* . Let v be the length n 2 — 1 vector 
with entries (Kij(AB)ij)ij , ranging over all (i,j) 7^ (i* , j*). 
The error of X is s(i* , j*) times the dot product of v and the 
±1 vector represented by s. Since s is 2- wise independent, 
Var(X) < 1 1 v 1 1 § . Let L denote the set of coordinates of the 
6/20 largest entries in AB (absolute value), different from 
(i* ,j*), with ties resolved arbitrarily. We would like to argue 
that with probability 9/10 two events hold simultaneously: 

• K(i,j) = for all g L. 

• MI2 < cr 2 , where a 2 = 20 E,rr b J 20 (AB)/b. 

When this is the case we say that v is "good". The first 
event holds with probability 19/20 by the union bound, since 
Pr[K(i,j) — 1] — 1/b. To analyze the second event we 
focus on the vector v' obtained from v by fixing K(i,j) — 
for (i,j) € L. The expected value of ||v'||| is bounded 
by a 2 /20. Thus by Markov's inequality we have ||v'||| < 
a 2 with probability 19/20. Note that when the first event 
holds we have v' = v. So we conclude that v is good with 
probability at least 9/10. 

For each estimator X, since Var(X) < ||v||| the proba- 
bility that the error is of magnitude t or more is at most 
I ! v 1 1 2 / * 2 by Chebychev's inequality. So for a good vector 
v the probability that t 2 > 7a' 2 > 7||v||jj is at most 1/49. 
Thus for each estimator the probability that it is based on 
a good vector and has error less than 

VTo^< 12 ^Eri b J 20 ( AB)/b 

is at least 1 - 1/10 - 1/49 > 7/8. 

Finally observe that it is unlikely that d/2 or more estima- 
tors have larger error. As in the proof of Theorem [5] we get 
that the probability that this occurs is o(2 _d//3 ) = o(n~ 2 ). 
Thus, the probability that the median estimator has larger 
error is o(n -2 ). □ 



4. SUBLINEAR RESULT EXTRACTION 

In analogy with the sparse recovery literature (see [IS] for 
an overview) we now consider the task of extracting the most 
significant coefficients of the approximation matrix C in time 
o(n 2 ). In fact, if we allow the compression algorithm to use 
a factor 0(lgn) more time and space, the time complexity 
for decompression will be 0(6 lg 2 n). Our main tool is error- 
correcting codes, previously applied to the sparse recovery 
problem by Gilbert et al. [16]. However, compared to [16] 
we are able to proceed in a more direct way that avoids 
iterative decoding. We note that a similar result could be 
obtained by a 2-dimensional dyadic decomposition of [n] x 
[n], but it seems that this would result in time 0(6 lg 3 n) for 
decompression. 

For A > let Sa = {(i, j) \ \(AB)ij\ > A} denote the set 
of entries in AB with magnitude larger than A, and let L 
denote the b/n largest entries in AB, for some constant k. 
Our goal is to compute a set S of 0(6) entries that con- 
tains L n Sa where A = O (\J Err b J * (AB)/b). Intuitively, 
with high probability we should output entries in L if their 
magnitude is significantly above the standard deviation of 
entries of the approximation C. 

4.1 Approach 

The basic approach is to compute a sequence of Count 
Sketches using the same set of hash functions to approxi- 
mate different submatrices of AB. The set of sketches that 
contain a particular entry will then reveal (with high prob- 
ability) the location of the entry. The submatrices are con- 
structed using a good error-correcting code 

E:[n]^{0,lY, 

where I = 0(lgn). Let Ie t denote the diagonal matrix 
where entry (i,i) equals E(i) r , bit number r of E(i). Then 
Ie t A is the matrix that is derived from A by changing entries 
to zero in those rows i for which E(i) r = 0. Similarly, we can 
derive BIe t from B by changing entries to zero in columns 
j where E(j) r = 0. The matrix sketches that we compute 
are: 

C r . = (I Er A)B, forre={l,...,fh and 
C. r = A{BI Er _ t ), for r 6 {£ + 1, . . . , 2£} 

We aim to show the following result. 

Theorem 7. Assume d = 0(lgn) is sufficiently large. 

There exists a constant k such that if A > k \J Err b J K (AB) jb 
then FindSignificantEntries(A) returns a set ofOib) po- 
sitions that includes the positions of the b/n entries in AB 
having the highest magnitudes, possibly omitting entries with 
magnitude below A. The running time is 0(61g 2 n), space 
usage is 0(61gn), and the result is correct with probability 
1 — A. 

Combining this with Theorem [5] and Lemma [l] we obtain: 

Corollary 8. Let A be an n\-by-n-z matrix, and B an 
n^-by-nz matrix, with N nonzero entries in total. Further 
suppose that AB is known to have at most b nonzero entries. 
Then a sparse representation of AB, correct with probability 
1 — can be computed in time 0(N + n2&lg6 + 61g 2 n), 
using space 0(6 lgn) in addition to the input. 



function FindSignificantEntries(A) 
S :=0 

for t := 1 to d do 
for fc := 1 to 6 do 

for r := 1 to 2^ do X r := jp (t ' r) [fc]| end for 
s := e 

for r := 1 to 21 do 

if X r > A/2 then s := s[|l else s := s||0 
end for 

(i, j) :=DECODE(s) 
INSERt((z, j),S) 
end for 
end for 
for G S do 

if |{(i,i)} n 5| < d/2 then Delete((j, j), S) end if 
end for 
return S 
end 



Figure 3: Method for computing the positions 
of 0(b) significant matrix entries of magnitude A or 
more. String concatenation is denoted ||, and e de- 
notes the empty string. Decode(s) decodes the (cor- 
rupted) codewords formed by the bit string s (which 
must have length a multiple of I), returning an arbi- 
trary result if no codeword is found within distance 
8i. Insert(:t, S) inserts a copy of x into the multiset 
S, and Delete(:e, S) deletes all copies of x from S. 
FindSignificantEntries can be used in conjunction 
with Decompress to obtain a sparse approximation. 



4.2 Details 

We now fill in the details of the approach sketched in 
section |4.1| Consider the matrix sketches of Q . We use 



4.2.1 Proof of theorem^ 



to denote polynomial t in the sketch number r, for 



... . 

r = l,...,2£. 

For concreteness we consider an expander code [24], which 
is able to efficiently correct a fraction 8 = fi(l) errors. Given 
a string within Hamming distance 81 from E(x), the input 
x can be recovered in time O(l), if the decoding algorithm 
is given access to a (fixed) lookup table of size 0(n). (We 
assume without loss of generality that St is integer.) 

Pseudocode for the algorithm computing the set of po- 
sitions (i, j) can be found in figure [3] For each splitting 
function h^(i,j) = (h^ (i) + (J)) mod b, and each hash 
value k we try to recover any entry £ L n Sa with 

h (t) (i,j) = k. The recovery will succeed with good proba- 
bility if there is a unique such entry. As argued in sect ion [3l^ 
we get uniqueness for all but a small fraction of the splitting 
functions with high probability. 

The algorithm first reads the relevant magnitude X r from 
each of the 21 sketches. It then produces a binary string s 
that encodes which sketches have low and high magnitude 
(below and above A/2, respectively). This string is then 
decoded into a pair of coordinates that are inserted in 

a multiset S. 

A post-processing step removes "spurious" entries from S 
that were not inserted for at least d/2 splitting functions, 
before the set is returned. 
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and a hash 



It is easy to see that FindSignificantEntries can be im- 
plemented to run in expected time 0(db£), which is C(blg 2 n), 
and space O(db), which is 0(6 lgn). The implementation 
uses the linear time algorithm for Decode 
table to maintain the multiset S. 

Also, since we insert db positions into the multiset S, and 
output only those that have cardinality d/2 or more, the set 
returned clearly has at most 2b distinct positions. It remains 
to see that each entry 6 L n Sa is returned with high 
probability. Notice that 



({lE r A)B) t 



{A{BlE r ))i 



if E(i) r = 1 
otherwise 



1° 

f(AB)«i ifE(j) r = l 
1 otherwise 



Therefore, conditioned on = k we have that the 

random variable X r = |p (t ' r) [fc]| has E[X r ] G {0,\{AB) ZJ \}, 
where the value is determined by the rth bit of the string 
s — E(i)\\E(j). The algorithm correctly decodes the rth bit 
if X r < A/2 for s r = 0, and X r > A/2 for s r = 1. In 
particular, the decoding of a bit is correct if (AB)y > A 
and X r deviates by at most A/2 from its expectation. 

From the proof of Theorem [6] we see that the proba- 
bility that the error of a single estimator is greater than 

12 -\J Err^/ 20 (AB) / b is at most 1/8. If A is at least twice as 
large, this error bound implies correct decoding of the bit 
derived from the estimator, assuming (AB)i*j* > A. Ad- 
justing constants 12 and 20 to a larger value k the error 
probability can be decreased to (5/3. This means that the 
probability that there are 81 errors or more is at most 1/3. 
So with probability 2/3 Decode correctly identifies (i* ,j*), 
and inserts it into S. 

Repeating this for d different hash functions the expected 
number of copies of in S is at least 2d/3, and by 

Chernoff bounds the probability that there are less than d/2 
copies is 2~ n( - d \ For sufficiently large d = O(lgn) the prob- 
ability that any entry of magnitude A or more is missed is 
less than 1/n. 

5. ESTIMATING COMPRESSIBILITY 

To apply theorems [5] and [6] it is useful to be able to com- 
pute bounds on compressibility of AB. In the following sub- 
sections we consider, respectively, estimation of the number 
of nonzero entries, and of the Err_F value. 

5.1 Number of nonzero entries 

An constant-factor estimate of b > b can be computed in 
time O(NlgN) using Cohen's method [9] or its refinement 
for matrix products ^5]. Recall that b is an upper bound on 
the number of nonzeros, when not taking into account that 
there may be zeros in AB that are due to cancellation of 
terms. We next show how to take cancellation of terms into 
consideration, to get a truly output-sensitive algorithm. 

The idea is to perform a doubling search that terminates 
(whp.) when we arrive at an upper bound on the number 
of nonzero entries that is within a factor 0(1) from the true 
value. Initially, we multiply A and B by random diagonal 
matrices (on left and right side, respectively). This will not 
change the number of nonzero entries, but by the Schwartz- 



Zippel lemma it ensures that a linear combination of entries 
in AB is zero (whp.) only when all these entries are zero. 

The doubling search creates sketches for AB using 6 — 
2, 4, 8, 16, . . . until, say, |& entries of the sketch vector be- 
come zero for all hash functions hs 1 ', . . . , h^ d \ Since there is 
no cancellation (whp.), this means that the number of dis- 
tinct hash values (under h(i,j)) of nonzero entries is 
at most 6/5. 

We wish to bound the probability that this happens when 
the true number 6 of nonzero entries is larger than b. The 
expected number of hash collisions is ( 2 )/&- If the number 
of distinct hash values of nonzero entries is at most 6/5 the 
average number of collisions per entry is 6/(6/5) — 1. This 
means that, assuming 6 > 6, the observed number of colli- 
sions can be lower bounded as: 



6 

b/5 



1 /2> 



— i/2> 
6/5 s/ " 



26 
b+1 



/b 



Note that the observed value is a factor > 4/3 larger 
than the expectation. So Markov's inequality implies that 
this happens with probability at most 3/4. If it happens 
for all d hash functions, we can conclude with probability 
1 - (3/4) - d that 6 is an upper bound on the number of 
nonzero entries. 

Conversely, as soon as 6/5 exceeds the number 6 of nonzero 
entries we are guaranteed to finish the doubling search. This 
means we get a 5-approximation of the number of non-zeros. 

5.2 Upper bounding En b J K (AB) 

To be able to conclude that the result of FindSignificant- 
Entries is correct with high probability, using theorem [7] 

we need an upper bound on \J Err^/ K (AB) /&. We will in 
fact show how to estimate the larger value 



Err£(AB)/&= \\AB\\ F /Vb, 



so the allowed value for A found may not be tight. We leave 
it as an open problem to efficiently compute a tight upper 

bound on y / En> /,c (AB)/6. 

The idea is to make use of the AMS sketch X of AB 
using the approach described in section |2.1| (summing the 
sketches for the outer products) . If we use 4- wise indepen- 
dent hash functions Si and S2, Indyk and McGregor [18] (see 
also Braverman et al. [5] for a slight correction of this result) 
have shown that X 2 is an unbiased estimator for the sec- 
ond moment of the input, which in our case equals ||AB||J., 
with variance at most 8E[X 2 ] 2 . By Chebychev's inequality 
this means that X 2 is a 32-approximation of ||j4B||f? with 
probability 3/4. (Arbitrarily better approximation can be 
achieved, if needed, by taking the mean of several estima- 
tors.) To make sure that we get an upper bound with high 
probability we take the median of d — O(logn) estimators, 
and multiply this value by 32. The total time for computing 
the upper bound is 0(dn 2 ). 

6. CONCLUSION 

We have seen that matrix multiplication allows surpris- 
ingly simple and efficient approximation algorithms in cases 
where the result is sparse or, more generally, its Frobenius 
norm is dominated by a sparse subset of the entries. Of 



course, this can be combined with known reductions of ma- 
trix multiplication to (smaller) matrix products 14 23 
yield further (multiplicative error) approximation results 



to 
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